refactor: pipeline RNG convention + warnings.warn#34
Conversation
- Add RNGRoot.numpy_child() for numpy RandomState substreams - Pipeline functions (build_v5, build_v6) accept seed: int instead of np.random.RandomState, deriving per-step RNGs via RNGRoot internally - Replace print(..., file=sys.stderr) with warnings.warn() in library code; CLI scripts keep print() for user-facing progress messages - Update tests: capsys → pytest.warns(), pass seed= kwarg - Add 3 tests for numpy_child() Closes #31, closes #32 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR aligns the v5/v6 dataset pipeline RNG interface with the repo’s RNGRoot convention and replaces stderr print() warnings in library code with warnings.warn(), while updating scripts/tests to the new seed-based API.
Changes:
- Add
RNGRoot.numpy_child(name)to provide deterministic NumPy/Pandas-compatible RNG substreams. - Refactor v5/v6 pipeline functions to accept
seed: int(instead ofnp.random.RandomState) and derive per-step RNGs internally viaRNGRoot. - Replace
print(..., file=sys.stderr)withwarnings.warn(..., stacklevel=2)in pipeline code and update tests to assert warnings withpytest.warns().
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| leadforge/core/rng.py | Adds RNGRoot.numpy_child() for deterministic NumPy/Pandas RNG substreams. |
| leadforge/pipelines/build_v5.py | Switches pipeline steps to seed-based RNG derivation and replaces stderr prints with warnings.warn(). |
| leadforge/pipelines/build_v6.py | Same refactor as v5, including seed-based RNG derivation and warning emission. |
| scripts/build_v5_snapshot.py | Updates CLI orchestration to pass seed to pipeline functions (keeps CLI stderr progress prints). |
| scripts/build_v6_snapshot.py | Same as v5 script: pass seed to pipeline functions and remove unused NumPy RNG. |
| tests/core/test_rng.py | Adds tests for numpy_child() determinism/independence. |
| tests/scripts/test_build_v5_snapshot.py | Updates tests to pass seed= and assert warnings via pytest.warns() instead of capsys. |
| tests/scripts/test_build_v6_snapshot.py | Updates tests to pass seed= across v6 pipeline steps. |
| .agent-plan.md | Documents completion checklist for the refactor work. |
Comments suppressed due to low confidence (1)
leadforge/pipelines/build_v6.py:173
- The
assign_acquisition_wavedocstring says the noise is added “at the boundaries”, but the implementation applies ~5% noise across all rows (noise_mask = rng.random(n) < 0.05). Either constrain the noise mask to boundary regions or update the docstring so it matches the actual behavior.
"""Assign acquisition_wave (A, B, C) based on lead index position.
Waves A/B/C are roughly chronological: first third = A, middle = B,
last third = C. A small amount of noise is added at the boundaries.
"""
rng = RNGRoot(seed).numpy_child("acquisition_wave")
df = df.copy()
n = len(df)
waves = np.empty(n, dtype=object)
third = n // 3
# Base assignment by position (stable across seeds)
waves[:third] = "A"
waves[third : 2 * third] = "B"
waves[2 * third :] = "C"
# Add ~5% boundary noise so it's not perfectly deterministic
noise_mask = rng.random(n) < 0.05
noise_vals = rng.choice(["A", "B", "C"], size=n)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… import - Regenerate v6 CSVs to match new per-function RNG derivation (old CSVs were stale after RNG refactor changed random streams) - Re-validate: all mandatory checks pass with updated metrics - Update RELEASE_v6.md with new baseline numbers - Fix misleading numpy_child() docstring: clarify it uses 4-byte truncation (not 8-byte like child()), producing independent streams - Make numpy import lazy in core/rng.py to keep the module importable without numpy as a hard dependency - Document RandomState-vs-Generator choice in docstring (migration deferred, tracked separately) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
CI ruff 0.15.12 enforces PT030: pytest.warns() must have match parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes an unresolved review comment on PR #34 in repository https://github.com/leadforge-dev/leadforge
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: leadforge/core/rng.py
URL: https://github.com/leadforge-dev/leadforge/pull/34#discussion_r3170991378
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`numpy_child()` docstring says it uses the same derivation as `child()`, but the implementation re-hashes and truncates to 4 bytes (vs `child()` truncating to 8). To keep RNG derivation consistent across the repo, consider deriving the numpy seed from `self.child(name)` (e.g., `getrandbits(32)`/`getrandbits(64)` then adapt to RandomState) and update the docstring to reflect the actual truncation/derivation rules.Run metadata: |
There was a problem hiding this comment.
Pull request overview
This PR aligns the dataset pipeline code with the project’s RNG conventions and replaces stderr print() warnings in library code with warnings.warn(), while updating scripts and tests accordingly.
Changes:
- Added
RNGRoot.numpy_child(name)to produce deterministic NumPyRandomStatesubstreams for pandas/numpy operations. - Refactored v5/v6 pipeline step functions to accept
seed: intand internally derive named RNG substreams viaRNGRoot. - Replaced
print(..., file=sys.stderr)warnings in library pipeline steps withwarnings.warn(..., stacklevel=2)and updated tests to assert warnings withpytest.warns().
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| leadforge/core/rng.py | Adds numpy_child() for deterministic NumPy RNG substreams derived from the root seed. |
| leadforge/pipelines/build_v5.py | Pipeline steps now take seed and use warnings.warn() instead of stderr prints. |
| leadforge/pipelines/build_v6.py | Same refactor as v5, including per-step named numpy RNG substreams. |
| scripts/build_v5_snapshot.py | Script now passes seed into pipeline steps; removes local RandomState. |
| scripts/build_v6_snapshot.py | Script now passes seed into pipeline steps; removes local RandomState. |
| tests/core/test_rng.py | Adds tests covering numpy_child() type + determinism + name independence. |
| tests/scripts/test_build_v5_snapshot.py | Updates calls to pass seed= and switches stderr assertions to pytest.warns(). |
| tests/scripts/test_build_v6_snapshot.py | Updates calls to pass seed= for all stochastic pipeline steps. |
| lead_scoring_intro/RELEASE_v6.md | Updates reported metrics/uplift values for v6 release notes. |
| .agent-plan.md | Updates plan/checklist with the new pipeline RNG + warnings refactor and new metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | ROC-AUC | 0.667 | | ||
| | PR-AUC | 0.429 | | ||
| | Base rate | 30.0% | | ||
| | Precision@25 | 0.480 (Lift: 1.60x) | | ||
| | Precision@50 | 0.420 (Lift: 1.40x) | | ||
| | Precision@25 | 0.520 (Lift: 1.73x) | | ||
| | Precision@50 | 0.480 (Lift: 1.60x) | |
| | Model | Mean AUC | vs LR | | ||
| |---|---|---| | ||
| | Logistic Regression | 0.658 | — | | ||
| | GBM (100 trees) | 0.680 | +0.022 | | ||
| | Logistic Regression | 0.627 | — | | ||
| | GBM (100 trees) | 0.643 | +0.016 | |
| | 50 | $1,379,208 | $1,949,380 | +41.3% | | ||
| | 25 | $1,203,430 | $1,380,990 | +14.8% | | ||
| | 50 | $1,809,281 | $2,014,459 | +11.3% | | ||
|
|
* refactor: wire sample_hidden_graph RNG through RNGRoot Change sample_hidden_graph() to accept an RNGRoot instance instead of a bare int seed, aligning it with the cross-component RNG convention established in PR #34. The old int-seed interface is preserved with a deprecation warning for backward compatibility. Closes #9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: drop backward-compat shim, clean up signature Address self-review feedback: - Remove unnecessary deprecation warnings — sample_hidden_graph is internal, no external consumers need backward compat - Make rng_root a required RNGRoot parameter (no union type) - Add strict TypeError for non-RNGRoot input - Create RNGRoot once in generator.py rather than inline - Replace deprecation tests with proper type-error tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
RNGRoot.numpy_child()method for numpyRandomStatesubstreams, bridging the project's RNG convention with pandas/numpy stochastic operationsbuild_v5,build_v6) now acceptseed: intinstead ofnp.random.RandomState, deriving per-step RNGs viaRNGRootinternallyprint(..., file=sys.stderr)withwarnings.warn()in library code; CLI scripts keepprint()for user-facing progresscapsys→pytest.warns(), passseed=kwargCloses #31, closes #32
Test plan
numpy_child, updated v5/v6 pipeline tests)ruff check+ruff formatcleanpytest.warns(UserWarning, match=...)for specificity🤖 Generated with Claude Code